NOTE: Code Annotations are under each chunk in BOLD
If you ever want to watch the greatest minds of the football world prove that they are elite, look no further than a close game with 2 minutes or less left on the clock. The “2-minute Drill” has been a staple tactic employed by teams for almost as long as the game has been around. Wikipedia defines this style of hurry-up offense as a “high-pressure and fast-paced situational strategy where a team will focus on clock management, maximizing the number of plays available for a scoring attempt before a half (or game) expires.” When teams perform the 2-minute drill, you should expect them to manage the clock using timeouts and plays that eliminate a running clock. You can expect a two minute drill drive in about 1 in 5 games, so it makes sense why these drives are so important.
pbp_2018_2021 <- load_pbp(2018:2021)
nfl_qbr_weekly <- readr::read_csv("https://raw.githubusercontent.com/nflverse/espnscrapeR-data/master/data/qbr-nfl-weekly.csv")
nfl_qbr_weekly<-nfl_qbr_weekly %>%
filter(season==2018:2021)
Two_min_drill <- pbp_2018_2021 %>%
filter(half_seconds_remaining<120, as.numeric(ms(drive_game_clock_start))<150)
Two_min_drill %>%
group_by(game_id, drive) %>%
select(game_id, drive_play_count, ydsnet, drive_game_clock_start, fixed_drive_result, name, posteam, week, drive_start_yard_line, wp) %>%
mutate(td = ifelse(fixed_drive_result=='Touchdown', 1, 0),
fg = ifelse(fixed_drive_result=='Field goal', 1, 0)) %>%
mutate(score= ifelse(td+fg==1, 1,0))
Two_min_drill %>%
group_by(game_id, drive) %>%
mutate(rush_attempt = ifelse(is.na(rush_attempt), 0, rush_attempt)) %>%
mutate(pass_attempt = ifelse(is.na(pass_attempt), 0, pass_attempt)) %>%
summarize(num_plays= n(), num_rush=sum(rush_attempt), num_pass= sum(pass_attempt))
two_min_new <- Two_min_drill %>%
group_by(game_id, drive) %>%
mutate(td = ifelse(fixed_drive_result=='Touchdown', 1, 0),
fg = ifelse(fixed_drive_result=='Field goal', 1, 0),
score= ifelse(td+fg==1, 1,0)) %>%
right_join(nfl_qbr_weekly, by = c("week" = "game_week", "posteam"="team_abb", "season" = "season")) %>%
ungroup()
getmode <- function(v) {
v <- na.rm()
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
two_min_new <- two_min_new[!is.na(two_min_new$passer), ]
two_min_new <- two_min_new[!is.na(two_min_new$passer), ]
In this code chunk we read in the play by play data and weekly qbr data from the 2018 season to present day. We got the play by play data from the package “nflfastr” found on the website Pysport. It shows data on every play for every game with over 300 variables to describe the play and the overall game during that play. The qbr data comes from the “nflverse” package. Like the “tidyverse” package, this package contains many different libraries including the library that loads weekly Quarterback performance metrics. We load in every week from 2018-Today just like the play by play data.
From there, we filtered the play by play data to show only plays that occurred when the game clock read 2:30 or less in either the first or second half of the game. We decided to add 30 seconds to the 2 minutes because there are many 2 minute drill drives that start before 2 minutes are on the clock, so we didn’t want to exclude those drives.
To look into the 2 minute drill specific data, we grouped the data by drive in order to look at the whole drive and showed some summary statistics. We then created a binary variable to show whether that drive scored a touchdown, field goal, or simply scored points in general. We then joined the weekly qbr data to the new 2 minute drill data.
two_min_by_drive <-
two_min_new %>%
group_by(drive, game_id.x) %>%
mutate(run_plays = sum(rush_attempt, na.rm = TRUE),
pass_plays = sum(pass_attempt, na.rm = TRUE),
pass_tot_yds = sum(air_yards, na.rm = TRUE),
completion_perc = (1- sum(incomplete_pass, na.rm = TRUE) / pass_plays),
tot_yds = sum(yards_gained, na.rm = TRUE),
rush_yds_tot = sum(rushing_yards, na.rm = TRUE)) %>%
mutate(td = ifelse(fixed_drive_result=='Touchdown', 1, 0),
fg = ifelse(fixed_drive_result=='Field goal', 1, 0),
score = ifelse(td+fg==1, 1,0)) %>%
select(passer, qbr_raw, qbr_total, pass_tot_yds, tot_yds, ydsnet, rush_yds_tot, rusher, completion_perc, run_plays, pass_plays, drive_yards_penalized, tot_yds, drive_game_clock_start, td, fg, score, posteam, drive_start_yard_line, headshot_href, wp, season) %>%
mutate(yards_to_go_start= ifelse(str_extract(drive_start_yard_line, "[A-Z]+")== posteam, 100- parse_number(drive_start_yard_line), parse_number(drive_start_yard_line)))
drive_summary_data <- two_min_by_drive %>%
arrange(game_id.x, drive) %>%
group_by(game_id.x) %>%
mutate(
td = as.factor(td),
fg = as.factor(fg),
score = as.factor(score)
) %>%
summarise_all(last)
drive_summary_data <- drive_summary_data %>%
mutate(passer= replace(passer, passer=="Aa.Rodgers", "A.Rodgers"))
drive_summary_data
The next dataset we create here groups the data by drive just like the last. However, we created more variables such as; the number of run plays, pass plays, total passing yards, qb completion percentage, total yards and rushing total yards. Finally, we create a variable that shows the number of yards the team has to drive to reach the endzone. This variable was a bit harder to create because we used a character variable that shows which territory a team was in and what yardline they were on at the beginning of the drive. For example, if the team was on the “TEN 25” we checked to see that the team on offense was the same as the abbreviation in the beginning of the character. If so, we subtract the number that followed from 100 to show how many yards they have left to go.
The last bit of cleaning we do is to fix data points related to Aaron Rodgers. The data had him listed as both “Aa.Rodgers” and “A.Rodgers”. This would not work to show his overall stats.
Here you can see the number of drives that began under 2 minutes and 30 seconds left in the first or second halves from 2018 to today. As you can see, it is not too often that a team successfully completes a drive by scoring points. Just more than 21% of the time, teams have scored at least 3 points by kicking a field goal. While only 10% of the time, teams have reached the endzone for 6.
drive_summary_data %>%
summarize(num_td_drives= sum(td==1), num_fg_drives= sum(fg==1), num_drives=n(), td_success_perc= num_td_drives/num_drives, fg_success_perc= num_fg_drives/num_drives) %>%
dplyr::rename("Number of TD's Scored"= num_td_drives) %>%
dplyr::rename("Number of FG's Scored"= num_fg_drives) %>%
dplyr::rename("Number of Drives"= num_drives) %>%
dplyr::rename("TD Success Rate"= td_success_perc) %>%
dplyr::rename("FG Success Rate"= fg_success_perc) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Two Minute Drill Success Rates" = 5))
|
Two Minute Drill Success Rates
|
||||
|---|---|---|---|---|
| Number of TD’s Scored | Number of FG’s Scored | Number of Drives | TD Success Rate | FG Success Rate |
| 28 | 62 | 289 | 0.0968858 | 0.2145329 |
Here I created a summary data table using the package “KableExtra” in order to show summary statistics for all 2 minute drill drives. (# of TD’s, # of Fg’s, # of drives, TD and FG success rate)
drive_summary_data %>%
group_by(season) %>%
summarize(season, num_td_drives= sum(td==1), num_fg_drives= sum(fg==1), num_drives=n(), td_success_perc= num_td_drives/num_drives, fg_success_perc= num_fg_drives/num_drives ) %>%
dplyr::slice(1) %>%
dplyr::rename("Number of TD's Scored"= num_td_drives) %>%
dplyr::rename("Number of FG's Scored"= num_fg_drives) %>%
dplyr::rename("Number of Drives"= num_drives) %>%
dplyr::rename("TD Success Rate"= td_success_perc) %>%
dplyr::rename("FG Success Rate"= fg_success_perc) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("2-minute Drills in the last 4 Seasons" = 6))
|
2-minute Drills in the last 4 Seasons
|
|||||
|---|---|---|---|---|---|
| season | Number of TD’s Scored | Number of FG’s Scored | Number of Drives | TD Success Rate | FG Success Rate |
| 2018 | 7 | 12 | 70 | 0.1000000 | 0.1714286 |
| 2019 | 7 | 19 | 83 | 0.0843373 | 0.2289157 |
| 2020 | 10 | 18 | 78 | 0.1282051 | 0.2307692 |
| 2021 | 4 | 13 | 58 | 0.0689655 | 0.2241379 |
I do the same thing on this next code chunk, but split by year.
Now why should we take a look at this specific aspect of football? A successful 2-minute drill could have massive impacts on the probability of the team winning the game. Let me remind you of week 10 in 2020. The Buffalo Bills drove the length of the field, managing their own two minute drill. Josh Allen and the Bills capped off the drive by finding the endzone when Allen slung a beautiful 40 yard dot to his favorite target Stefon Diggs with just over 30 seconds left in the game. There was a 90% probability that the Bills had just secured the win, but the Cardinals had other plans.
In a mere 3 plays, the Cardinals marched down to the 43 yard line in Bills territory. The rest will go down as one of the greatest plays in NFL history. Murray scrambles out of the pocket and heaves one down the field to a triple teamed DeAndre Hopkins who reaches up and snags the Bills hopes. Although a little luck was involved in this drive, the Cardinals effectively managed the amount of time they were given and won the game despite the statistics. That is what a great two minute drill drive can do for a team any given week.
Here we can see the quarterbacks that have the best rate of scoring either a field goal or a touchdown over the past 4 seasons. Those closer to the top right show us that they are efficient when the clock strike below 2.
drive_summary_data %>%
group_by(passer) %>%
summarize(num_successful= sum(score==1), num_2min_drives= n(), success_perc= num_successful/num_2min_drives, headshot=headshot_href) %>%
arrange(desc(num_successful)) %>%
dplyr::slice(1) %>%
ggplot(aes(x=num_2min_drives, y= num_successful))+
geom_text_repel(aes(label=passer)) +
geom_image(aes(image= headshot), size=0.05, asp= 16/9)+
labs(x= "Number of 2-minute drill drives",
y= "Number of Successfull 2-minute drill drives",
title= "Every QB 2-Minute Drill Efficiency from 2018-2021",
caption = "Data: @nflfastR")+
scale_y_continuous(breaks = scales::pretty_breaks(n = 8)) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 15))+
theme_bw()
(Ben) This may be a plot that I am most proud of creating. Using each players headshot, I plotted the number of successful 2-minute drill drives each QB in the league has completed compared to how many drives they have attempted.
There are few positions that require such a responsibility as an NFL Quarterback. The only other position in team sports that stand out to me is an MLB Pitcher. As you can see from these two plots, the better the QB plays that week, the higher probability the team has at completing that 2-minute drill with new points on the board. Every decision a Quarterback makes during those final 2 minutes could have massive repercussions, so its important the team has the right guy for the job.
drive_summary_data %>%
ggplot(aes(x=qbr_total, y=wp)) +
facet_wrap(~score) +
geom_point()+
geom_smooth() +
labs(x="Total QBR for that week",
y= "Current Winning Percentage during the Drive",
title = "Does scoring(TD or FG) on a 2-Minute Drill Drive increase Win Percentage?",
caption = "Data: @nflfastR",
color= "Scored?")+
scale_color_manual(labels = c("No", "Yes"), values = c("blue", "red"))
# qbr // score
drive_summary_data %>%
ggplot(aes(x = qbr_total, y = score)) +
geom_boxplot()
We decided to make these two plots a little bit more simple because we didn’t want to overcrowd the overall message of the plots (That Quarterback play affects the probability of that team winning). The first plot shows a scatter plot of all the drives, faceted by those that score and those that did not. On top of the plot we show a line of best fit in order to show that the line of best fit is higher indicating a greater win probability. The next plot shows that those quarterbacks that led a successful 2-minute drive had a better performance on the week overall.
We will explore the relationship between football stats in our set and the score outcome of two minute drills from the 2018 - 2021 seasons.
We will do so using primarily a logistic regression and will also explore random forests and boosted trees to ensure we explored all other options.
two_min_test <-
two_min_new %>%
group_by(drive, game_id.x) %>%
mutate(run_plays = sum(rush_attempt, na.rm = TRUE),
pass_plays = sum(pass_attempt, na.rm = TRUE),
pass_tot_yds = sum(air_yards, na.rm = TRUE),
completion_perc = (1- sum(incomplete_pass, na.rm = TRUE) / pass_plays),
tot_yds = sum(yards_gained, na.rm = TRUE),
rush_yds_tot = sum(rushing_yards, na.rm = TRUE)) %>%
mutate(td = ifelse(fixed_drive_result=='Touchdown', 1, 0),
fg = ifelse(fixed_drive_result=='Field goal', 1, 0),
score = ifelse(td+fg==1, 1,0)) %>%
select(qbr_raw, qbr_total, pass_tot_yds, tot_yds, ydsnet, rush_yds_tot, completion_perc, run_plays, pass_plays, drive_yards_penalized, tot_yds, drive_game_clock_start, td, fg, score, posteam, drive_start_yard_line)
two_min_by_drive <-
two_min_test %>%
mutate(yards_to_go_start= ifelse(str_extract(drive_start_yard_line, "[A-Z]+")== posteam, 100- parse_number(drive_start_yard_line), parse_number(drive_start_yard_line)))
two_min_by_drive
drive_summary_data <- two_min_by_drive %>%
arrange(game_id.x, drive) %>%
group_by(game_id.x) %>%
mutate(
td = as.factor(td),
fg = as.factor(fg),
score = as.factor(score),
drive_game_clock_start = as.numeric(ms(drive_game_clock_start))
) %>%
summarise_all(last)
drive_summary_data
drive_summary_data$yards_to_go_start[is.na(drive_summary_data$yards_to_go_start)] <- 50
drive_summary_data %>%
select(game_id.x, drive_game_clock_start, completion_perc, yards_to_go_start)
This code segments our dataset further. By grouping by drive and game id, we are able to now look at each tow min drive as one observation as opposed to a collection of plays. We also mutated and created some new variables which summed and did some basic math to complement the variables we currently have, including our important score indicators as binary outputs (td, fg, and score which is td or fg).
One of the most difficult cleaning tasks we found was creating a yards to go start variable, the yards left to go to the endzone from the start of the drive. In its current form, we had a team name in 3 - 4 characters and a yardline as the field goes 0 to 50 and then back down to 0. This new variable removed the characters preceding and used that to calculate yards to go.
We then created our primary dataset that we will use in our analysis, drive summary data, which included only the variables we wanted to predict our outcome variable, score. We tidy’d this up by ensuring the drives are arranged in a logical way, still grouped to represent an observation, with small factor and numeric transformations where needed.
One other difficult task was cleaning and dealing with NAs. This last couple of lines deal with NAs in the yards to go start as they were derived from simply being a “50” prior, with no characters prior to parse. Thus, I replaced these values with the appropriate “50”.
Our primary model we will be focusing on is a Logistic Regression. This model specializes in predicting probabilities of our outcome, not just the outcome. That way with this model, we can not only see whether it predicts a score or not but we can see and have access to the probabilities it used to predict the outcome.
When creating this model, we will be using a Lasso approach which stands for “Least Absolute Shrinkage and Selection Operator.” In short, this type of modeling selects variables and their impact size while taking into account maximizing the accuracy and interpretability of the model.
drive_summary_data$completion_perc[is.na(drive_summary_data$completion_perc)] <- 0
drive_summary_data$ydsnet[is.na(drive_summary_data$ydsnet)] <- 0
drive_summary_data$drive_yards_penalized[is.na(drive_summary_data$drive_yards_penalized)] <- 0
drive_summary_data$drive_game_clock_start[is.na(drive_summary_data$drive_game_clock_start)] <- 0
drive_summary_data$score[is.na(drive_summary_data$score)] <- 0
drive_summary_data <-
drive_summary_data %>%
select( -posteam, -td, -fg, -game_id.x, -drive, -drive_start_yard_line)
drive_summary_data %>%
add_n_miss() %>%
count(n_miss_all)
We continue working to address pesky NA values… we dealt with them similarly as above except we found these NAs to be present when that field did not occur. Thus, we replaced these where appropriate with “0”.
We then continued to mold the dataset we will use to analyze by removing variables that did not have predictive power because of their nature or their structure being characters (team, game id, drive, or drive start line) or variables that were too closely tied to our outcome variables (td, fg).
We lastly checked to ensure our NA subbing worked, which it did.
set.seed(2)
drive_two_min_split <- initial_split(drive_summary_data,
prop = .75, strata = score)
drive_two_min_training <- training(drive_two_min_split)
drive_two_min_testing <- testing(drive_two_min_split)
We began our modeling work here with a set seed to ensure we continue to get the same split no matter how much we run this chunk. We split our data int training and testing to ensure we have extra data to test our model on, to prevent over fitting
drive_two_min_training %>%
count(score)
To prep for out recipe and modeling, we wanted to peak at the distribution of our outcome variable, score, to see if we needed to up sample if it was lopsided (which is is).
set.seed(2)
lasso_recipe <- recipe(score ~ .,
data = drive_two_min_training) %>%
step_upsample(score, over_ratio = 1) %>%
step_dummy(all_nominal(),
-all_outcomes()) %>%
step_normalize(all_predictors(),
-all_outcomes())
We set a seed again to make sure we have reproduce-ability. We proceed with our recipe, a necessary part of building our model, by stating what we desire to perform and transforming variables to what they need to be.
This model will use all variables from our training set to predict our score outcome, as seen above. We made sure we adjusted our score outcome by up sampling and making the split more even between score 0 and score 1…. if we did not, the model may have chosen to simply choose the majority class every time because of how simple and accurate that would be, yet that is a terrible model. Additionally, we made sure all factors were dummies and we had all predictors normalized and not skewed, minus the outcome variables of course.
We pipe into our recipe to peak at how it transforms out training data, ensuring it is how we desire. We added some special table styling with Kable. It is clear there has been a lot of normalizing and dummying as we desired.
lasso_mod <-
logistic_reg(mixture = 1) %>%
set_engine("glmnet") %>%
set_args(penalty = tune()) %>%
set_mode("classification")
lasso_wf <- workflow() %>%
add_recipe(lasso_recipe) %>%
add_model(lasso_mod)
lasso_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_upsample()
## • step_dummy()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = tune()
## mixture = 1
##
## Computational engine: glmnet
We create our model here, specifying our desired type, log reg, and its associated engine, glmnet. We set to tune our tuning parameter, penalty, which is specific to lasso and how it shrinks and regularizes variables based upon this parameter. And lastly, we designate our desired type being Classification because score is binary, 0 or 1.
We also create our workflow, which combines our model and recipe together into one.
set.seed(2)
cv_split <- vfold_cv(drive_two_min_training,
v = 5)
penalty_grid <- grid_regular(penalty(),
levels = 10)
penalty_grid
lasso_tune <-
lasso_wf %>%
tune_grid(
resamples = cv_split,
grid = penalty_grid,
control = control_stack_grid())
We set our seed again to reproduce and designate our training cross validation to split 5 folds, a pretty standard value.
We also establish a penalty grid of possible penalty values that we will tune and test to see which best fits our model. We incorporate this into our lasso tune which uses our work flow (with our model and recipe together) to help us find the right penalty parameter for our most accurate and best predicting model. We specify within this our desire to use 5 folds in this process and set a control grid to ensure we are applying what we want to all re-sampling objects.
Below are the results of our 5 fold cross validation from our modeling. In short, we separated the data into two groups: one to build the model off (75%) and the other to test and see how well the model works.
Each fold is using a different part of the 75% of the training set to build the model and testing upon the rest. This is done over and over and over to amass tons of data on the best logistic regression model.
lasso_tune %>%
select(id, .metrics) %>%
unnest(.metrics) %>%
filter(.metric == "accuracy") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Models Performance" = 6))
|
Lasso Log Reg Models Performance
|
|||||
|---|---|---|---|---|---|
| id | penalty | .metric | .estimator | .estimate | .config |
| Fold1 | 0.0000000 | accuracy | binary | 0.7272727 | Preprocessor1_Model01 |
| Fold1 | 0.0000000 | accuracy | binary | 0.7272727 | Preprocessor1_Model02 |
| Fold1 | 0.0000000 | accuracy | binary | 0.7272727 | Preprocessor1_Model03 |
| Fold1 | 0.0000002 | accuracy | binary | 0.7272727 | Preprocessor1_Model04 |
| Fold1 | 0.0000028 | accuracy | binary | 0.7272727 | Preprocessor1_Model05 |
| Fold1 | 0.0000359 | accuracy | binary | 0.7272727 | Preprocessor1_Model06 |
| Fold1 | 0.0004642 | accuracy | binary | 0.7272727 | Preprocessor1_Model07 |
| Fold1 | 0.0059948 | accuracy | binary | 0.7272727 | Preprocessor1_Model08 |
| Fold1 | 0.0774264 | accuracy | binary | 0.7272727 | Preprocessor1_Model09 |
| Fold1 | 1.0000000 | accuracy | binary | 0.7272727 | Preprocessor1_Model10 |
| Fold2 | 0.0000000 | accuracy | binary | 0.8604651 | Preprocessor1_Model01 |
| Fold2 | 0.0000000 | accuracy | binary | 0.8604651 | Preprocessor1_Model02 |
| Fold2 | 0.0000000 | accuracy | binary | 0.8604651 | Preprocessor1_Model03 |
| Fold2 | 0.0000002 | accuracy | binary | 0.8604651 | Preprocessor1_Model04 |
| Fold2 | 0.0000028 | accuracy | binary | 0.8604651 | Preprocessor1_Model05 |
| Fold2 | 0.0000359 | accuracy | binary | 0.8604651 | Preprocessor1_Model06 |
| Fold2 | 0.0004642 | accuracy | binary | 0.8837209 | Preprocessor1_Model07 |
| Fold2 | 0.0059948 | accuracy | binary | 0.9069767 | Preprocessor1_Model08 |
| Fold2 | 0.0774264 | accuracy | binary | 0.9069767 | Preprocessor1_Model09 |
| Fold2 | 1.0000000 | accuracy | binary | 0.6046512 | Preprocessor1_Model10 |
| Fold3 | 0.0000000 | accuracy | binary | 0.7906977 | Preprocessor1_Model01 |
| Fold3 | 0.0000000 | accuracy | binary | 0.7906977 | Preprocessor1_Model02 |
| Fold3 | 0.0000000 | accuracy | binary | 0.7906977 | Preprocessor1_Model03 |
| Fold3 | 0.0000002 | accuracy | binary | 0.7906977 | Preprocessor1_Model04 |
| Fold3 | 0.0000028 | accuracy | binary | 0.7906977 | Preprocessor1_Model05 |
| Fold3 | 0.0000359 | accuracy | binary | 0.7906977 | Preprocessor1_Model06 |
| Fold3 | 0.0004642 | accuracy | binary | 0.7906977 | Preprocessor1_Model07 |
| Fold3 | 0.0059948 | accuracy | binary | 0.7906977 | Preprocessor1_Model08 |
| Fold3 | 0.0774264 | accuracy | binary | 0.7674419 | Preprocessor1_Model09 |
| Fold3 | 1.0000000 | accuracy | binary | 0.1860465 | Preprocessor1_Model10 |
| Fold4 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model01 |
| Fold4 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model02 |
| Fold4 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model03 |
| Fold4 | 0.0000002 | accuracy | binary | 0.8837209 | Preprocessor1_Model04 |
| Fold4 | 0.0000028 | accuracy | binary | 0.8837209 | Preprocessor1_Model05 |
| Fold4 | 0.0000359 | accuracy | binary | 0.8837209 | Preprocessor1_Model06 |
| Fold4 | 0.0004642 | accuracy | binary | 0.8837209 | Preprocessor1_Model07 |
| Fold4 | 0.0059948 | accuracy | binary | 0.8604651 | Preprocessor1_Model08 |
| Fold4 | 0.0774264 | accuracy | binary | 0.8139535 | Preprocessor1_Model09 |
| Fold4 | 1.0000000 | accuracy | binary | 0.6511628 | Preprocessor1_Model10 |
| Fold5 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model01 |
| Fold5 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model02 |
| Fold5 | 0.0000000 | accuracy | binary | 0.8837209 | Preprocessor1_Model03 |
| Fold5 | 0.0000002 | accuracy | binary | 0.8837209 | Preprocessor1_Model04 |
| Fold5 | 0.0000028 | accuracy | binary | 0.8837209 | Preprocessor1_Model05 |
| Fold5 | 0.0000359 | accuracy | binary | 0.8837209 | Preprocessor1_Model06 |
| Fold5 | 0.0004642 | accuracy | binary | 0.8837209 | Preprocessor1_Model07 |
| Fold5 | 0.0059948 | accuracy | binary | 0.9069767 | Preprocessor1_Model08 |
| Fold5 | 0.0774264 | accuracy | binary | 0.8837209 | Preprocessor1_Model09 |
| Fold5 | 1.0000000 | accuracy | binary | 0.6511628 | Preprocessor1_Model10 |
Here we pipe into our tune to take a look at how our five folds different models performed based upon their different accuracy estimates with their associated different penalty parameters.
Those different folds and their resulting accuracies were also used to try out different penalty parameters for lasso, which in this case helps to determine what the insignificance cut off is for throwing out “indeterminate” variables. The accuracies appear to be quite high for most penalties.
lasso_tune %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
ggplot(aes(x = penalty, y = mean)) +
geom_point() +
geom_line() +
scale_x_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10",scales::math_format(10^.x))) +
labs(x = "penalty", y = "accuracy")
Using our Lasso tune, we are able to collect and plot our accuracy values associated with our different penalty parameters to help visualize the smallest penalty we can assess all while maximizing accuracy
The goal of modeling is to extract our most accurate model and its associated penalty so that we can finalize the model and interpret some meaning.
lasso_tune %>%
show_best(metric = "accuracy") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Best Accuracy Models" = 7))
|
Lasso Log Reg Best Accuracy Models
|
||||||
|---|---|---|---|---|---|---|
| penalty | .metric | .estimator | mean | n | std_err | .config |
| 0.0059948 | accuracy | binary | 0.8384778 | 5 | 0.0350123 | Preprocessor1_Model08 |
| 0.0004642 | accuracy | binary | 0.8338266 | 5 | 0.0321576 | Preprocessor1_Model07 |
| 0.0000000 | accuracy | binary | 0.8291755 | 5 | 0.0306547 | Preprocessor1_Model01 |
| 0.0000000 | accuracy | binary | 0.8291755 | 5 | 0.0306547 | Preprocessor1_Model02 |
| 0.0000000 | accuracy | binary | 0.8291755 | 5 | 0.0306547 | Preprocessor1_Model03 |
We also pipe into our tune to see our best accuracy metric and its associated penalties and model designations.
The best penalty which maximized our accuracy is below and will be directly input to finalize our model.
best_param <- lasso_tune %>%
select_best(metric = "accuracy")
best_param %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Best Model" = 2))
|
Lasso Log Reg Best Model
|
|
|---|---|
| penalty | .config |
| 0.0059948 | Preprocessor1_Model08 |
We specifically hone in on our sole bets model based on highest accuracy, and can pull its associated penalty parameter to use as our set penalty in our finalized model to come.
lasso_final_wf <- lasso_wf %>%
finalize_workflow(best_param)
lasso_final_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 3 Recipe Steps
##
## • step_upsample()
## • step_dummy()
## • step_normalize()
##
## ── Model ───────────────────────────────────────────────────────────────────────
## Logistic Regression Model Specification (classification)
##
## Main Arguments:
## penalty = 0.00599484250318942
## mixture = 1
##
## Computational engine: glmnet
Using our previously found penalty based upon best accuracy performance, we use that to pipe into our workflow with our model and recipe and finalize our previously designated penalty parameter of “tune” to this best metric.
Now for some interpretation and meaning….below we have the coefficients of each variable from the logistic regression. While normally in log odds form, we exponentiated the variable coefficients to now have them in an odds and odds ratio form, a much more interpretable style.
lasso_final_mod <- lasso_final_wf %>%
fit(data = drive_two_min_training)
lasso_final_mod %>%
pull_workflow_fit() %>%
tidy() %>%
select(-penalty) %>%
mutate(estimate = exp(estimate)) %>% #for odds ratio
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Finalized Model Coefficients" = 2))
|
Lasso Log Reg Finalized Model Coefficients
|
|
|---|---|
| term | estimate |
| (Intercept) | 0.7093673 |
| qbr_raw | 1.7764990 |
| qbr_total | 1.0000000 |
| pass_tot_yds | 0.8238078 |
| tot_yds | 2.9777486 |
| ydsnet | 6.9298994 |
| rush_yds_tot | 1.0738806 |
| completion_perc | 1.0000000 |
| run_plays | 0.9714654 |
| pass_plays | 1.0000000 |
| drive_yards_penalized | 2.1472343 |
| drive_game_clock_start | 0.5105232 |
| yards_to_go_start | 0.2084632 |
With this finalized workflow, we can create a finalized model fitting this to our training data.
Most importantly, we were able to grab the models variable coefficients, exponentiating them to take them from a log odds interpretation to an odds and odds ratio one. That way, we are able to make more sense of this.
Above, you can see the logistic regression model output for all of our variables and their coefficients.
Generally, if a coefficient is above one (1) it shows a positive impact upon the likelihood of a two minute drill drive resulting in success.
A coefficient value of one (1) indicates no change. And a coefficient value of less than one (1) means an increase in that variable results in a lesser chance of scoring.
Variables that positively impact the likelihood of scoring:
Variables that negatively impact the likelihood of scoring:
You can see the result of yards net having the largest impact upon score outcome in the Variable Importance plot below.
The massive positive impact upon score outcome is nearly matched by the massive negative impact upon scoring derived from the yards to go variable, which had the largest negatively correlated variable coefficient above.
You may notice more variables included on the VI plot than coefficients above. That is because the lasso approach regularized and shrunk the least contributing variables coefficients to basically zero, as seen with these additional variables being quite unimportant to predicting score outcome.
lasso_final_mod %>%
pull_workflow_fit() %>%
vip()
From this final model, we can also plot a variable importance plot and see which variables contributed the most and we3re the most influential when trying to predict score, with this matching the patterns we saw in the coefficients above.
Our best fitting model was quite well in prediction with a strong 86% accuracy. Additionally, its ROC AUC, or the area under the curve, is quite high as well at 88% denoting a high level of confidence that the model will be able to distinguish between the score and no score class (88% confident to be exact).
lasso_test <- lasso_final_wf %>%
last_fit(drive_two_min_split)
lasso_test %>%
collect_metrics() %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Best Model Accuracy and ROC" = 4))
|
Lasso Log Reg Best Model Accuracy and ROC
|
|||
|---|---|---|---|
| .metric | .estimator | .estimate | .config |
| accuracy | binary | 0.8630137 | Preprocessor1_Model1 |
| roc_auc | binary | 0.8843478 | Preprocessor1_Model1 |
We then fit again with this final model now using our previous cross validation split. We again can collect important metrics from this final model that we built all the way through the lasso process.
We can dig a bit deeper into how well the model predicts by looking at how the model predicts on certain specific instances from our testing set. Along with predicted class and actual class, you can also see the probability associated with each class leading to the prediction, with 50% as the threshold.
collect_predictions(lasso_test) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Model Predictions" = 7))
|
Lasso Log Reg Model Predictions
|
||||||
|---|---|---|---|---|---|---|
| id | .pred_0 | .pred_1 | .row | .pred_class | score | .config |
| train/test split | 0.9248411 | 0.0751589 | 2 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9701972 | 0.0298028 | 4 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0432723 | 0.9567277 | 7 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9555767 | 0.0444233 | 11 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8839128 | 0.1160872 | 13 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9235451 | 0.0764549 | 19 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0173215 | 0.9826785 | 20 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9252537 | 0.0747463 | 22 | 0 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0072199 | 0.9927801 | 29 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.4299063 | 0.5700937 | 31 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0325675 | 0.9674325 | 39 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9396011 | 0.0603989 | 40 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.7568316 | 0.2431684 | 41 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0169140 | 0.9830860 | 54 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0056706 | 0.9943294 | 57 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9995023 | 0.0004977 | 62 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8934304 | 0.1065696 | 63 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.1507618 | 0.8492382 | 70 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9980552 | 0.0019448 | 71 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.1290671 | 0.8709329 | 79 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9844911 | 0.0155089 | 82 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.6382677 | 0.3617323 | 83 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0121763 | 0.9878237 | 90 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0636521 | 0.9363479 | 96 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.8925979 | 0.1074021 | 98 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.5141586 | 0.4858414 | 105 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.5187826 | 0.4812174 | 110 | 0 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9991532 | 0.0008468 | 114 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.5866608 | 0.4133392 | 115 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0762273 | 0.9237727 | 131 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9922189 | 0.0077811 | 134 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.3317259 | 0.6682741 | 135 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0551177 | 0.9448823 | 137 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.7191612 | 0.2808388 | 141 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9676352 | 0.0323648 | 147 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8878541 | 0.1121459 | 148 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.2872720 | 0.7127280 | 154 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0303182 | 0.9696818 | 157 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9954687 | 0.0045313 | 165 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9343015 | 0.0656985 | 167 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.6338601 | 0.3661399 | 175 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.7780076 | 0.2219924 | 176 | 0 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9649654 | 0.0350346 | 180 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8478255 | 0.1521745 | 183 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9979944 | 0.0020056 | 191 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9812421 | 0.0187579 | 196 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0151430 | 0.9848570 | 198 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.8307942 | 0.1692058 | 201 | 0 | 1 | Preprocessor1_Model1 |
| train/test split | 0.5495523 | 0.4504477 | 202 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9781594 | 0.0218406 | 203 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.2875578 | 0.7124422 | 205 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9842568 | 0.0157432 | 207 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8430698 | 0.1569302 | 208 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0658695 | 0.9341305 | 221 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9852418 | 0.0147582 | 223 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.0077908 | 0.9922092 | 231 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.2298614 | 0.7701386 | 232 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.0492719 | 0.9507281 | 233 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9974937 | 0.0025063 | 238 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9607829 | 0.0392171 | 240 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.3483177 | 0.6516823 | 241 | 1 | 0 | Preprocessor1_Model1 |
| train/test split | 0.7627011 | 0.2372989 | 246 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9881403 | 0.0118597 | 250 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.8605305 | 0.1394695 | 256 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9309724 | 0.0690276 | 257 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9972081 | 0.0027919 | 263 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9957033 | 0.0042967 | 268 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.4231486 | 0.5768514 | 270 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9685365 | 0.0314635 | 271 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9792166 | 0.0207834 | 276 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.4849208 | 0.5150792 | 283 | 1 | 1 | Preprocessor1_Model1 |
| train/test split | 0.9195004 | 0.0804996 | 286 | 0 | 0 | Preprocessor1_Model1 |
| train/test split | 0.9932594 | 0.0067406 | 288 | 0 | 0 | Preprocessor1_Model1 |
This section uses our finalized and fit model from the previous chunk to predict upon our testing data previously established. This is a very useful chunk as you can see the details of drives, specifically their outcome, their predicted outcome, and why they had such predicted outcomes with their associated probabilities listed.
We can also look at our predictions in the aggregate with a matrixed table.
preds <-
collect_predictions(lasso_test)
conf_mat(preds, .pred_class, score)
## Truth
## Prediction 0 1
## 0 44 6
## 1 4 19
With these predictions from our fit above, we can see the final model performance in the aggregate with a conf matrix displaying how many predictions were predicted each way and how accurate it was.
Other important model metrics are laid out in the table below, with these metrics also proving how well the model predicts.
Metrics to Note:
Sens = Sensitivity – Ratio between how much was classified as a score to how much was actually a score of that Spec = Specificity – Ratio between how much was classified as not a score to how much was actually not a score of that Precision – Ratio of how much was correctly classified as a score out of all scores F meas – Indicates classification strength
custom_metrics <- metric_set(accuracy, sens, spec, precision, f_meas)
custom_metrics(preds, truth = score,
estimate = .pred_class) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg Finalized Model Metrics" = 3))
|
Lasso Log Reg Finalized Model Metrics
|
||
|---|---|---|
| .metric | .estimator | .estimate |
| accuracy | binary | 0.8630137 |
| sens | binary | 0.8800000 |
| spec | binary | 0.8260870 |
| precision | binary | 0.9166667 |
| f_meas | binary | 0.8979592 |
This code does a lot….it grabs very important metrics from the finalized and fit model, specifically from when it predicted our testing data. It uses score as the true metric of score and predicted class to see when these two matched and when they didn’t. It used this matching process to spit out very important metrics looking at how the model predicted the positive case, negative case, and in general all cases.
The relationship between sensitivity and specificity can be seen below The dotted line in the middle is what the roc curve would look like if the model predicted based on random 50/50 choice. The further our line is away from this line, the better the model is as we see.
preds %>%
roc_curve(truth = score, .pred_1) %>%
autoplot()
Here we plot the ROC curve to ensure the model has validity, which is proven with the model being very far from the arbitrary dotted 50/50 line. This specifically is looking at predictions of the positive class, a score (score == 1)
Below is a lay out of what probabilities in this model resulted in certain predictions. You can see the impact of the 0.5 threshold as hardly any 1 predictions occur where the probability of being 1 is less than 0.5 but it jumps after crossing 0.5.
preds %>%
ggplot() +
geom_density(aes(x = .pred_1, fill = score),
alpha = 0.5)
This chunk continues to use the predictions from above to visually show how the predictions of each class are present around certain probabilities, with 0.5 as the significant threshold. This also is primarily focusing on the primary case of a score.
With this great and accurate model, we put it to the test with a recent drive from the College Football playoffs.
drive_two_min_training
We peak at our set again to prep our new test data below….
Predicting with our model…
(2) Clemson vs (1) Alabama: Jan 9, 2017
31-28 Alabama with 2:01 left in 4Q Clemson ball on Clem 32, 68 yards to go to win the game Clemson ends up driving all 68 yards for a Deshaun Watson pass to Hunter Renfrow for a TD to win the game with 1 second left.
How does our best model, the logistic regression predict this?
Data was obtained from ESPN’s cache of play by play …. some unavailable data at the time like game qbr was replaced with an average of that metric for that player for the season
watson_to_renfrow <- tribble(~qbr_raw, ~qbr_total, ~pass_tot_yds, ~tot_yds, ~ydsnet, ~rush_yds_tot, ~completion_perc, ~run_plays, ~pass_plays, ~drive_yards_penalized, ~drive_game_clock_start, ~yards_to_go_start, ~score,
104.5, 104.5, 60, 68, 68, 1, 0.6667, 1, 9, 7, 127, 68, 1)
watson_to_renfrow %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg New QB Entry" = 13))
|
Lasso Log Reg New QB Entry
|
||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| qbr_raw | qbr_total | pass_tot_yds | tot_yds | ydsnet | rush_yds_tot | completion_perc | run_plays | pass_plays | drive_yards_penalized | drive_game_clock_start | yards_to_go_start | score |
| 104.5 | 104.5 | 60 | 68 | 68 | 1 | 0.6667 | 1 | 9 | 7 | 127 | 68 | 1 |
Here we create our new data set representing the exact drive and scenario listed above and pipe in to display it.
predict(lasso_final_mod, new_data = watson_to_renfrow) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Log Reg New QB Entry - Prediction" = 1))
|
Lasso Log Reg New QB Entry - Prediction
|
|---|
| .pred_class |
| 1 |
We use our finalized and fit model to take in this new case as new data, as we did with the testing data, and output the prediction.
You can see our model predicts a score correctly!
For the reasons above, we feel very comfortable with the highly accurate and interpretable Lasso Log Regression. To ensure we weren’t missing a home run with any other model types, we also explored modeling with two different decision tree applications.
The next model we will briefly touch upon is a decision tree, specifically a random forest model. This model type uses a series of decision branch offs which work to split based upon significant variables.
set.seed(2)
drive_two_min_split <- initial_split(drive_summary_data,
prop = .75, strata = score)
drive_two_min_training <- training(drive_two_min_split)
drive_two_min_testing <- testing(drive_two_min_split)
cv_split <- vfold_cv(drive_two_min_training,
v = 5)
We set the seed to reproduce, and split the data the same way we did with lasso, doing the same with training and testing and splitting for 5 cv folds.
rf_recipe <- recipe(score ~ .,
data = drive_two_min_training) %>%
step_upsample(score, over_ratio = 1)
Our recipe for random forest is much more basic with only our needed up sampling desired for the aforementioned reasons.
set.seed(2)
rf_model <- rand_forest(mtry = tune(),
min_n = tune(),
trees = 100) %>%
set_mode("classification") %>%
set_engine("ranger")
We set up our random forest model, with classification again and the appropriate engine of ranger. We also set trees at a standard 100 and prep to tune min n (min num of trees) and mtry (what goes into how the tree determines when to stop splitting the branches).
rf_workflow <- workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_model)
Here we establish the workflow by combining recipe and model.
rf_penalty_grid <- grid_regular(
finalize(mtry(), drive_two_min_training %>% select(-score)),
min_n(),
levels = 3)
rf_tune <-
rf_workflow %>%
tune_grid(
resamples = cv_split,
grid = rf_penalty_grid,
control = control_stack_grid())
We create a “penalty grid” for our tuning parameters and apply that and our cv splitting, with control, to create our tune to find desirable mtry and min n vars.
Performing a similar process as above, we were able to pull our overall accuracy from our best model and our ROC AUC as well. We can see a decently high 80% accuracy rate and 88% ROC AUC.
rf_tune %>%
collect_metrics(metric = "accuracy") %>%
filter(.config == "Preprocessor1_Model3") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree Random Forest Best Model" = 8))
|
Decision Tree Random Forest Best Model
|
|||||||
|---|---|---|---|---|---|---|---|
| mtry | min_n | .metric | .estimator | mean | n | std_err | .config |
| 12 | 2 | accuracy | binary | 0.8427061 | 5 | 0.0167620 | Preprocessor1_Model3 |
| 12 | 2 | roc_auc | binary | 0.9053316 | 5 | 0.0175032 | Preprocessor1_Model3 |
We pipe in to grab our best accuracy and roc auc to see the best values for our tuning parameters.
rf_tune %>%
select_best(metric = "accuracy") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree Random Forest Best Model" = 3))
|
Decision Tree Random Forest Best Model
|
||
|---|---|---|
| mtry | min_n | .config |
| 12 | 2 | Preprocessor1_Model3 |
We then pipe in to hone in on our specific desired and best tuning parameters.
rf_tune %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
ggplot(aes(x = min_n, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Min Num of Trees", y = "accuracy")
We now visualize our min n tune to ensure we have the best value with the best accuracy (which we do with 2)
rf_tune %>%
collect_metrics() %>%
filter(.metric == "accuracy") %>%
ggplot(aes(x = mtry, y = mean)) +
geom_point() +
geom_line() +
labs(x = "Mtry", y = "accuracy")
We now visualize our mtry tune to ensure we have the best value with the best accuracy (which we do with 12)
You can see a goofier looking accuracy chart as the tuning parameters here are related to how the tree decides cut offs including minimum number of trees. You can see from our selected best model that the min_n optimized value of 2 and mtry value of 6 represent the peak accuracy values.
This model was lacking in interpretability in a sense that we desired a more standard regression output as provided by a logistic regression above.
Yet, this is another model option for future work.
Similar to the previous model we fit, a Boosted Decision tree creates a series of splits, or branches. The only difference with this type of model is that each tree is NOT independent of the others. This Boosted model takes information from the prior trees, and uses it to create the next splits.
xgboost_spec <-
boost_tree(
trees = 1000,
min_n = 5,
tree_depth = 2,
learn_rate = tune(),
loss_reduction = 10^-5,
sample_size = 1) %>%
set_mode("classification") %>%
set_engine("xgboost")
xgboost_recipe <- recipe(formula = score ~ ., data = drive_two_min_training) %>%
step_upsample(score, over_ratio = 1) %>%
step_mutate_at(all_numeric(),
fn = ~as.numeric(.)) %>%
step_novel(all_nominal_predictors()) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors())
xgboost_workflow <-
workflow() %>%
add_recipe(xgboost_recipe) %>%
add_model(xgboost_spec)
set.seed(2)
registerDoParallel()
boost_penalty_grid <- grid_regular(
learn_rate(),
levels = 10)
boost_tune <- xgboost_workflow %>%
tune_grid(
resamples = cv_split,
grid = boost_penalty_grid,
control = control_stack_grid())
Here we first specify the model and the mode we are planning to use the model for (Classification). We set our arguments such as the number of splits we want in our tree and the minimum number of nodes to be further split. Next, we write the recipe for the model. This is the same as all the previous two models. Just like other models, we put the recipe and model specification into a workflow and set the penalty grid. Finally, we tune the model using our training split.
After finding the best fit boosted model, we can again pull the overall accuracy of the model. Our boosted model shows an accuracy rate of 80.5% with an ROC AUC of 89%. This is almost exactly the same as the random forest.
boost_tune %>%
collect_metrics(metric = "accuracy") %>%
filter(.config == "Preprocessor1_Model10") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree Boosted Best Accuracy Models" = 7))
|
Decision Tree Boosted Best Accuracy Models
|
||||||
|---|---|---|---|---|---|---|
| learn_rate | .metric | .estimator | mean | n | std_err | .config |
| 0.1 | accuracy | binary | 0.8058140 | 5 | 0.0164032 | Preprocessor1_Model10 |
| 0.1 | roc_auc | binary | 0.8900178 | 5 | 0.0199485 | Preprocessor1_Model10 |
boost_tune %>%
select_best(metric = "accuracy") %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree Boosted Best Model" = 2))
|
Decision Tree Boosted Best Model
|
|
|---|---|
| learn_rate | .config |
| 0.01 | Preprocessor1_Model09 |
Here we used the tuned model in order to show the accuracy rate of the best model. We create another table using Kable to show Accuracy and ROC AUC.
Again, like the Random Forest model we created, the Boosted model is hard to interpret and actually less accurate than the logistic regression we fit first. It would be much more beneficial to have a model of which we can view each variable and understand how that affects the final prediction. The boosted model lacks that trait.
We will take a peak at some overall statistics from all three models to compare them across the board.
lasso_tune %>%
collect_predictions() %>%
group_by(id, penalty) %>%
summarize(accuracy = sum((score == .pred_class))/n(),
true_neg_rate = sum(score == 0 & .pred_class == 0)/sum(score == 0),
true_pos_rate = sum(score == 1 & .pred_class == 1)/sum(score == 1)) %>%
group_by(penalty) %>%
summarize(across(accuracy:true_pos_rate, mean)) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Lasso Models Performance" = 4))
|
Lasso Models Performance
|
|||
|---|---|---|---|
| penalty | accuracy | true_neg_rate | true_pos_rate |
| 0.0000000 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0000000 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0000000 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0000002 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0000028 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0000359 | 0.8291755 | 0.8603297 | 0.7812745 |
| 0.0004642 | 0.8338266 | 0.8680220 | 0.7812745 |
| 0.0059948 | 0.8384778 | 0.8623077 | 0.8180392 |
| 0.0774264 | 0.8198732 | 0.8049588 | 0.8599020 |
| 1.0000000 | 0.5640592 | 0.8000000 | 0.2000000 |
We pipe into our lasso tune here to grab our predictions and look at overall metrics and performance. We define how to calculate accuracy, sensitivity, and specificity. We produce these metrics grouping based upon the different possible tunes. Here, we are able to look at performance in general with many instances, not just one metric.
Avg_Accuracylr <- (0.8291755 + 0.8291755 + 0.8291755 + 0.8291755 + 0.8291755 + 0.8291755 + 0.8338266 + 0.8384778 + 0.8198732 + 0.5640592) / 10
Avg_Accuracylr
## [1] 0.803129
Log Reg Avg Accuracy = 0.803129
rf_tune %>%
collect_predictions() %>%
group_by(id, mtry, min_n) %>%
summarize(accuracy = sum((score == .pred_class))/n(),
true_neg_rate = sum(score == 0 & .pred_class == 0)/sum(score == 0),
true_pos_rate = sum(score == 1 & .pred_class == 1)/sum(score == 1)) %>%
group_by(mtry, min_n) %>%
summarize(across(accuracy:true_pos_rate, mean)) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree -- Random Forest Models Performance" = 5))
|
Decision Tree – Random Forest Models Performance
|
||||
|---|---|---|---|---|
| mtry | min_n | accuracy | true_neg_rate | true_pos_rate |
| 1 | 2 | 0.8058140 | 0.8908381 | 0.6390196 |
| 1 | 21 | 0.8195560 | 0.8853839 | 0.6843672 |
| 1 | 40 | 0.8007400 | 0.8369037 | 0.7227986 |
| 6 | 2 | 0.8242072 | 0.8776916 | 0.7161854 |
| 6 | 21 | 0.8059197 | 0.8427088 | 0.7312299 |
| 6 | 40 | 0.7779070 | 0.8122962 | 0.7210339 |
| 12 | 2 | 0.8427061 | 0.8803746 | 0.7712299 |
| 12 | 21 | 0.8195560 | 0.8512910 | 0.7661319 |
| 12 | 40 | 0.7963002 | 0.8000396 | 0.8078966 |
We pipe into our rf tune here to grab our predictions and look at overall metrics and performance. We define how to calculate accuracy, sensitivity, and specificity. We produce these metrics grouping based upon the different possible tunes. Here, we are able to look at performance in general with many instances, not just one metric.
Avg_Accuracyrf <- (0.8058140 + 0.8195560 + 0.8007400 + 0.8242072 + 0.8059197 + 0.7779070 + 0.8427061 + 0.8195560 + 0.7963002) / 9
Avg_Accuracyrf
## [1] 0.8103007
Random Forest Avg Accuracy = 0.8103007
boost_tune %>%
collect_predictions() %>%
group_by(id, learn_rate) %>%
summarize(accuracy = sum((score == .pred_class))/n(),
true_neg_rate = sum(score == 0 & .pred_class == 0)/sum(score == 0),
true_pos_rate = sum(score == 1 & .pred_class == 1)/sum(score == 1)) %>%
group_by(learn_rate) %>%
summarize(across(accuracy:true_pos_rate, mean)) %>%
kbl() %>%
kable_paper("striped") %>%
add_header_above(c("Decision Tree -- Boosted Tree Models Performance" = 4))
|
Decision Tree – Boosted Tree Models Performance
|
|||
|---|---|---|---|
| learn_rate | accuracy | true_neg_rate | true_pos_rate |
| 0e+00 | 0.6899577 | 1.0000000 | 0.0000000 |
| 0e+00 | 0.7313953 | 0.7341993 | 0.7378431 |
| 0e+00 | 0.7223044 | 0.7066131 | 0.7645098 |
| 1e-07 | 0.7223044 | 0.7066131 | 0.7645098 |
| 1e-06 | 0.7223044 | 0.7066131 | 0.7645098 |
| 1e-05 | 0.7223044 | 0.7066131 | 0.7645098 |
| 1e-04 | 0.7223044 | 0.7066131 | 0.7645098 |
| 1e-03 | 0.7500000 | 0.7014022 | 0.8745098 |
| 1e-02 | 0.8244186 | 0.8466325 | 0.7813725 |
| 1e-01 | 0.8058140 | 0.8447976 | 0.7165241 |
We pipe into our boost tune here to grab our predictions and look at overall metrics and performance. We define how to calculate accuracy, sensitivity, and specificity. We produce these metrics grouping based upon the different possible tunes. Here, we are able to look at performance in general with many instances, not just one metric.
Avg_Accuracybb <- (0.6899577 + 0.7595137 + 0.7642706 + 0.7642706 + 0.7642706 + 0.7642706 + 0.7315011 + 0.7497886 + 0.8243129 + 0.8335095) / 10
Avg_Accuracybb
## [1] 0.7645666
Boosted Trees Avg Accuracy = 0.7645666
All models seem to have high accuracies but Log Reg and Random Forest are roughly equal around 80%.
When factoring in our desired outcome with outputted coefficients, and taking into account the much higher true positive rate in the log reg output, we have much more confidence in our selection of the logistic regression model and stand by our analysis as appropriate
By cleaning play by play data, and creating summary statistics relating to each drive that occurred under two minutes, we were able to successfully analyze the importance of the 2-minute Drill in the game of football. Not only did we look into the players that are the “clutchest” under 2 minutes, but created our own model to correctly classify the result of a 2-minute drill drive. From there, we could observe the most important variable to determine what is the most important thing a team can do to end their drive with a score.
We show that QBR and number of penalties a team takes during the drive are the most important variables when it comes to team controllable variables. However, if you wanted to predict whether a team can end the drive in a score you should highly consider the number of yards the team has to drive.
I think looking further into each 2 minute drill could lead to even more useful insights. One downfall of our own format is that we didn’t create enough variables to summarize each drive. There could be many more drive specific characteristics that could effect the models we created, such as if the team driving is the home or away team and the weather during the game. Nonetheless, we decided to focus only on the on-field aspects of the drive, especially Quarterback play.
We see no potential harm for this analysis. We were very open about how we performed our analysis with the data and goal of the project being benign in scope. We see no negative uses of this data and see it only having possible upside with more stats reliance. Better equipped teams and offenses will lead to more in-game action, giving NFL fans what they desire. The only negative impact may be upon defensive specialist players.
lasso log reg with interpretable outputs and metrics
had many errors for days and finally got it w a small tweak